Finding Centuries-Old Hyperlinks: a Novel Semi-Supervised Shape Classifier
نویسندگان
چکیده
Hyperlinks are so useful for searching and browsing modern digital collections that researchers have longer wondered if it is possible to retroactively add hyperlinks to digitized historical documents. There has already been significant research into this endeavor for historical text; however, in this work we consider the problem of adding hyperlinks among graphic elements. While such a system would not have the ubiquitous utility of text-based hyperlinks, as we will show, there are several domains where it can significantly augment textual information. While OCR of historical text is known to be a difficult problem, the actual words themselves are inherently discrete. Thus, two words are either identical or not. This means that off-the-shelf machine learning algorithms, including semisupervised learning, can be easily used. However, as we shall demonstrate, semi-supervised learning does not work well with images, because we cannot expect binary matching decisions. Rather we must deal with degrees of matching. In this work we make the novel observation that this “degree of matching” biased algorithms make overly confident predictions about simple shapes. We show that a simple technique for correcting this bias, and demonstrate through extensive experiments that our method significantly improves accuracy on diverse historical image collections. Keywords-Historical Manuscripts, Hyperlinks, SemiSupervised Learning
منابع مشابه
Semi-Supervised Learning Based Prediction of Musculoskeletal Disorder Risk
This study explores a semi-supervised classification approach using random forest as a base classifier to classify the low-back disorders (LBDs) risk associated with the industrial jobs. Semi-supervised classification approach uses unlabeled data together with the small number of labelled data to create a better classifier. The results obtained by the proposed approach are compared with those o...
متن کاملSemi-supervised Learning Using an Unsupervised Atlas
In many machine learning problems, high-dimensional datasets often lie on or near manifolds of locally low-rank. This knowledge can be exploited to avoid the “curse of dimensionality” when learning a classifier. Explicit manifold learning formulations such as lle are rarely used for this purpose, and instead classifiers may make use of methods such as local co-ordinate coding or auto-encoders t...
متن کاملSemi-supervised learning for improved expression of uncertainty in discriminative classifiers
Seeking classifier models that are not overconfident and that better represent the inherent uncertainty over a set of choices, we extend an objective for semi-supervised learning for neural networks to two models from the ratio semi-definite classifier (RSC) family. We show that the RSC family of classifiers produces smoother transitions between classes on a vowel classification task, and that ...
متن کاملDiscriminative Similarity for Clustering and Semi-Supervised Learning
Similarity-based clustering and semi-supervised learning methods separate the data into clusters or classes according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose a novel discriminative similarity learning framework which learns discriminative similarity for either data clustering or semi-supervised learning...
متن کاملMinimally-supervised methods for Arabic Named Entity Recognition
Supervised methods can achieve high performance on NLP tasks, such as Named Entity Recognition (NER), but new annotations are required for every new domain and/or genre change. This has motivated research in minimally supervised methods such as semisupervised learning and distant learning, but neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-...
متن کامل